Effects of JPEG Compression on Vision Transformer Image Classification for Encryption-then-Compression Images

This paper evaluates the effects of JPEG compression on image classification using the Vision Transformer (ViT). In recent years, many studies have been carried out to classify images in the encrypted domain for privacy preservation. Previously, the authors proposed an image classification method that encrypts both a trained ViT model and test images. Here, an encryption-then-compression system was employed to encrypt the test images, and the ViT model was preliminarily trained by plain images. The classification accuracy in the previous method was exactly equal to that without any encryption for the trained ViT model and test images. However, even though the encrypted test images can be compressible, the practical effects of JPEG, which is a typical lossy compression method, have not been investigated so far. In this paper, we extend our previous method by compressing the encrypted test images with JPEG and verify the classification accuracy for the compressed encrypted-images. Through our experiments, we confirm that the amount of data in the encrypted images can be significantly reduced by JPEG compression, while the classification accuracy of the compressed encrypted-images is highly preserved. For example, when the quality factor is set to 85, this paper shows that the classification accuracy can be maintained at over 98% with a more than 90% reduction in the amount of image data. Additionally, the effectiveness of JPEG compression is demonstrated through comparison with linear quantization. To the best of our knowledge, this is the first study to classify JPEG-compressed encrypted images without sacrificing high accuracy. Through our study, we have come to the conclusion that we can classify compressed encrypted-images without degradation to accuracy.


Introduction
Significant advances in deep-learning technology have made it possible to automate and accelerate various tasks. In particular, image classification has been put to practical use in a variety of applications, such as face recognition and anomaly detection. In addition, cloud services have become popular among common organizations and individuals for the purpose of reducing costs, facilitating data sharing, and so on. For these reasons, image classification tasks are increasingly being accomplished through cloud servers. To utilize a model on a cloud server, a user should transmit test images to the server.
However, cloud servers are generally not reliable, and thus test images are under threat of being compromised outside of the servers. As a result, the copyright and privacy information in the test images might be disclosed to third parties. Additionally, the user generally needs to classify a large number of images; in other words, a large amount of image data should be transmitted to the server in succession. Therefore, it is desirable to minimize the data amount of the test images.
Many compression methods have been proposed to reduce the data amount of images. Compression methods can be classified into two categories: lossless and lossy methods. In general, lossy methods can more efficiently reduce the data amount compared with lossless methods. A typical lossy method is JPEG, which is one of the image compression standards. On the basis of human visual features, JPEG compression significantly reduces the information of high-frequency components and commonly applies 4:2:0 downsampling, i.e., horizontal and vertical downsampling of chrominance. Consequently, we can notably reduce the data amount while preserving high image quality. It is noted that JPEG compression can adjust the image quality and data amount by varying the quality factor.
In recent years, there has been a great amount of effort to develop secure imageclassification systems with copyright and privacy protection for images. Federated learning is one technique that can be used in developing such systems [1][2][3]. Multiple clients individually train a single model by using their own data, while a central server integrates the parameters trained by each client. This technique can protect training images but not test images. On another front, secure computation is also drawing attention. This technique can directly adapt computational operations to encrypted data. A large number of methods have been proposed that automatically classify data encrypted with secure computation [4][5][6]. These methods can protect test data; however, the encrypted data can hardly be compressed. Even if the encrypted data is successfully compressed, it is difficult to decrypt the data.
Another approach for protecting copyright and privacy information in test images is to conceal the visual information. Image encryption is a typical technique for concealing visual information, and image-encryption methods have been actively studied to train encrypted images using deep neural networks [3,[7][8][9][10][11][12][13][14][15][16][17]. The method in [3] combines federated learning with image encryption for test images. Encrypted image classification via a cloud server assumes that a user encrypts test images and transmits the encrypted images to a server. Thus, it is desirable to be able to compress the encrypted images in terms of the transmission efficiency; however, most such methods [3,[9][10][11][12][13][14][15][16] do not consider image compression. Aprilpyone et al. employed the encryption-then-compression (EtC) system [18] as an image encryption algorithm so that the encrypted images (hereafter, EtC images) possess a high compression performance [8]. Some other methods protect visual information using machine learning instead of encryption and classify protected images [19,20]. The methods [8][9][10][11][12][13][14][15]19,20], however, degrade the classification accuracy due to the protection of visual information.
The method in [8] employs the Vision Transformer (ViT) [21] and ConvMixer [22], which are called isotropic networks, as image-classification models. They are known to provide a higher classification accuracy compared with convolutional neural networks, which are the conventional mainstream image-classification models. Kiya et al. focused on the properties of ViT to maintain the classification accuracy for encrypted images [16]. This method prepares a series of encryption keys (hereafter, key set) and uses it to encrypt not only test images but also a trained ViT model. The encrypted ViT model is eventually suitable for the encrypted images. This is the first study that perfectly preserves the classification accuracy for encrypted images. However, the image encryption process in this method employs a pixel-wise transformation, so the encrypted images can hardly be compressed.
As an extension of the method [16], we previously introduced an EtC system for the image encryption process [17]. The EtC system is based on a block-wise transformation, and thus the EtC images can maintain high compression performance. Further, this method does not cause any degradation to the classification accuracy for EtC images by using a model encryption algorithm that corresponds to the EtC system. Therefore, we not only successfully avoid any degradation to the classification accuracy but also compress the encrypted images. In [17], we surveyed the performance of lossless compression using JPEG-LS [23].
On the basis of our previous method [17], this paper, for the first time, investigates the effects of JPEG compression, which is a widely used lossy compression standard, on the classification accuracy. In our experiments, we confirm that a high classification accuracy can be preserved even for JPEG-compressed EtC images. Moreover, this paper verifies the effectiveness of JPEG compression in terms of classification and compression performance compared with linear quantization.
In this paper, we demonstrate that JPEG noise added to the high-frequency component barely degrades the accuracy of ViT classification. To the best of our knowledge, this is the only study that successfully compresses encrypted images using the JPEG lossy standard and classifies the compressed encrypted-images with very little degradation to accuracy. Through a series of studies, we reach the conclusion that compressed EtC images can be classified without degradation to accuracy.

Preparation
We give an overview of ViT [21] and summarize our previous method [17] in this section. We previously proposed an image classification method using ViT with novel advantages; copyrights for both a trained ViT model and test images can be protected simultaneously without any decrease in the accuracy of classification, and the test images are effectively compressed using lossless image compression standards. This paper verifies the effects of JPEG lossy compression on the classification accuracy of ViT on the basis of our previous method.

Vision Transformer
An attention mechanism dynamically identifies the location that should be focused on within input data. This mechanism has notably contributed to enhancing accuracy in deep learning. In the field of natural-language processing, there is a transformer in which an attention mechanism is implemented that enhances the performance of machine translation [24]. By using the transformer for image classification tasks, ViT has achieved higher accuracy than with conventional methods, such as convolutional neural networks. Figure 1 shows an overview of ViT. ViT receives an input image x ∈ R H×W×C and outputs a prediction class y for the image. Here, H, W, and C denote the height, width, and number of channels of the input image, respectively. First, ViT divides x into patches x α p ∈ R P×P×C , where P is the patch size, and α ∈ {1, 2, · · · , N}. N represents the number of x α p . Here, we define a patch set x p ∈ R N×P×P×C as Each patch is then flattened to generate x α fp ∈ R P 2 C with a single dimension. We call x α fp a flattened patch. x α fp is linearly transformed into a vector with D dimensions using a matrix E ∈ R (P 2 C)×D , where D is the number of vector dimensions received by the transformer encoder. Further, a class token x class ∈ R D is located at the head of the sequence of vectors. The position information E pos ∈ R (N+1)×D is then embedded into the sequence of vectors so as to generate a matrix z 0 ∈ R (N+1)×D that is input to the transformer encoder. In summary, z 0 is represented as The transformer encoder contains L layers, and each layer consists of multi-head self-attention (MSA), multi-layer perceptron (MLP), and layer normalization (LN). The transformer encoder receives and transforms z 0 as z l = MLP(LN(z l )) + z l , l ∈ {1, 2, · · · , L}, z l = MSA(LN(z l−1 )) + z l−1 . ( Here, z l denotes the output from the l-th layer; thus, z L means the output from the final layer of the transformer encoder. Finally, y is derived from z 0 L , which is the head row in z L : From Equation (2), it is clear that the β-th pixel in every flattened patch is transformed by the β-th row in E, where β ∈ {1, 2, · · · , P 2 C}. On the other hand, the (α + 1)-th row in E pos is added to x α fp E. By focusing on these properties of ViT, the authors previously proposed a model-encryption method that corresponds to EtC images [17]. Our previous method can classify EtC images without any degradation to the classification accuracy. We outline our previous method in the following section.

Previous Classification Method for EtC Images through Encrypted ViT Model
This section describes our previous method that enables us to protect both a trained ViT model and test images while preserving high classification accuracy [17]. The test images can be efficiently compressed using lossless image compression standards. Figure 2 shows a block diagram of the previous method. Note that any images used in this method have RGB color channels. In this method, we assume a model in which there exist a single user, provider, and trusted third party. First, the trusted third party trains a ViT model with training images in the plane domain.
The parameters E and E pos in the trained ViT model are then transformed by a key set K = {K 1 , K 2 , K 3 , K 4 , K 5 } to encrypt the trained model. This process is called model encryption. The trusted third party transmits the encrypted model to the provider and the key set K to the user. The user encrypts test images using the EtC system [18] with K. This process will hereafter be called image encryption. The EtC images are subsequently transmitted to the provider. The provider obtains the classification results for the EtC images through the encrypted model and finally sends the classification results back to the user. The image and model-encryption procedures are detailed in Sections 3.2 and 3.3, respectively.
In this system, the user transmits the EtC images to the provider to obtain the classification results. Thus, the encrypted model is not disclosed to anyone outside the provider. This means that no one outside the provider can access and manipulate the encrypted model. The user, therefore, cannot decrypt the encrypted model despite having K. On the other hand, it is difficult to decrypt the EtC images without using K. The trusted third party does not provide K but the encrypted model itself to the provider, so the provider cannot decrypt the EtC images and expose the image content. Therefore, this system prevents unauthorized persons/organizations from obtaining plain test images and a plain model.  Using the previous method, we can obtain a suitable model for EtC images by encrypting the trained model. Accordingly, the classification results for EtC images through an encrypted model are identical to those for plain test images through a plain model. Furthermore, EtC images are expected to have a high compression performance since the encryption system employs a block-wise transformation. The previous method demonstrated that JPEG-LS compression [23] could significantly reduce the data amount of EtC images.
In contrast, JPEG is the most popular standard for lossy image compression. Thus, in this paper, we examine the effects of JPEG compression for EtC images on the classification accuracy and further assess the tradeoff between the accuracy and compression performance. To the best of our knowledge, this is the first study on image classification that maintains high classification accuracy against JPEG compression.

Evaluation of JPEG-Compression Effects on the Classification Results
This paper extends the previous method [17] to verify the effects of JPEG compression for EtC images on the classification results. This section first outlines evaluation schemes to investigate the JPEG-compression effects and then details the image and model-encryption procedure. Finally, we describe the evaluation metrics in our experiments. Figure 3 illustrates the flows of our evaluation schemes. We prepared two types of schemes to elaborately examine the effects of JPEG compression. Hereafter, the schemes shown in Figure 3a,b will be called evaluation schemes A and B, respectively. Note that all images used in this paper have RGB color channels.

Overview
First, a ViT model is trained by using plain training images in scheme A. In scheme B, the plain training-images are preliminarily compressed by JPEG, and the ViT model is trained by using the compressed images (JPEG training images, hereafter). The flow after model training is the same between the two evaluation schemes. A trusted third-party encrypts the trained model with a key set K = {K 1 , K 2 , K 3 , K 4 , K 5 }, and K and the encrypted model are transmitted to a user and a provider, respectively. The user encrypts test images using the EtC system [18] and compresses the EtC images by JPEG. The JPEG-compressed EtC images are then sent to the provider to be classified. The provider classifies each JPEGcompressed EtC image through the encrypted model and finally returns the classification results to the user.
In scheme A, test images encrypted by the EtC system are compressed by JPEG. Thus, we verify the compression effects for test images through comparison with our previous method [17]. In comparison, both training and test images are compressed by JPEG in scheme B. Through a comparison between schemes A and B, we examine the compression effects for training images on the classification of JPEG-compressed test images.   Figure 4 shows an image-encryption procedure. This encryption algorithm is an extension of the block-based image-encryption method [18], which is one of the EtC systems. We preliminarily prepare a key set K = {K 1 , K 2 , K 3 , K 4 , K 5 } so as to encrypt an input image. Note that K 1 , K 2 , and K 3 are key sets consisting of three keys {K R q , K G q , K B q } (q = 1, 2, 3), and K 4 and K 5 represent single keys. The image-encryption procedure is described as follows.

Image Encryption
Step i-1: Divide an input image into main blocks, and further divide each main block into sub blocks.
Step i-2: Translocate sub blocks within each main block using K 1 .
Step i-3: Rotate and flip each sub block using K 2 .
Step i-4: Apply a negative-positive transformation to each sub block using K 3 .
Step i-6: Shuffle the R, G, and B components in each sub block using K 4 .
Step i-8: Integrate all of the sub and main blocks. In Step i-1, the input image is divided into main and sub blocks as shown in Figure 5. We call Steps i-2 to i-6 sub-block encryption and Step i-7 main-block encryption.
Sub-block encryption includes five operations. Each operation, except normalization, is a sub-block-wise transformation in each main block, and K 1 , K 2 , K 3 , and K 4 are shared among all the main blocks. K q (q = 1, 2, 3) consist of three single keys K R q , K G q , and K B q corresponding to the R, G, and B components, respectively. Thus, each component can be transformed independently when K R q , K G q , and K B q are different from each other. In contrast, all the components are transformed commonly when the three keys are identical.
The former is called independent transformation, and the latter is called common transformation in this paper. The main-block encryption consists of a single operation, where the main blocks are translocated. Since K 5 for the main-block encryption is not a key set but a single key, the R, G, and B components should be translocated commonly. The encryption algorithm transforms an input image while preserving the pixel-to-pixel correlation in each sub block, and so the encrypted image is expected to be highly compressed.
Before we detail the sub-block and main-block encryptions, symbols are preliminarily defined as follows.
• H and W: the height and width of an image. • x ∈ {0, 1, · · · , 255} H×W×3 : an input image. • S mb and S sb : the main-block and sub-block sizes. • N mb : the number of main blocks. • N sb : the number of sub blocks within each main block.

Sub-Block Translocation
We first translocate sub blocks within each main block by using K 1 . Vectors v i (i ∈ {1, 2, 3}) are generated by K R 1 , K G 1 , and K B 1 , respectively. Each vector v i is represented as where v i j , v iĵ ∈ {1, 2, . . . , N sb }, and v i j = v iĵ if j =ĵ. The second dimension of x sb denotes a sub-block number; thus, the sub blocks are translocated by replacing their numbers with v i :

Block Rotation and Block Flipping
Next, we rotate and flip each sub block using K 2 . As shown in Figure 6, there are eight transformation patterns for each sub block. Three vectors r i (i ∈ {1, 2, 3}) are derived from K R 2 , K G 2 , and K B 2 , respectively. Each vector r i is denoted by where r i j ∈ {1, 2, . . . , 8}. The third and fourth dimensions of x sb(1) represent the position in the height and width directions in each sub block, respectively. Therefore, each sub block is rotated and flipped by translocating pixels within the sub block depending on r i : where R h = S sb − h + 1, and R w = S sb − w + 1.

Original sub-block
Eight pa�erns of transforma�on

Negative-Positive Transformation
We then apply a negative-positive transformation to each sub block with K 3 . Vectors n i (i ∈ {1, 2, 3}) are generated using K R 3 , K G 3 , and K B 3 and given by where n i j ∈ {1, 2}. The negative-positive transformation is conducted on the basis of n i :

Normalization
All pixels in x sb(3) should be normalized as where S is an arbitrary constant, while S = 255/2 in this paper. In the case of n i j = 1 in Equation (10), x sb(4) (m, s, h, w, c) can be expressed as Otherwise, x sb(4) (m, s, h, w, c) is given by From Equations (12) and (13), it is clear that the negative-positive transformation with normalization can be regarded as an operation of retaining or flipping the sign of each pixel value. This property prevents a model encryption algorithm from being complex. We detail the algorithm in Section 3.3.3.

Main-Block Translocation
Finally, the main blocks are translocated with K 5 . A vector k obtained by K 5 is given by where k t , kˆt ∈ {1, 2, . . . , N mb }, and k t = kˆt if t =t. The first dimension of x sb(5) represents a main-block number, so we translocate the main blocks by replacing their numbers with k: x sb (t, s, h, w, c) = x sb(5) (k t , s, h, w, c).

Model Encryption
This section describes the model-encryption procedure. While image encryption can protect visual information, it seriously deteriorates the classification accuracy. The model encryption in this paper not only cancels out the effects but also prevents unauthorized accesses to a trained ViT model by encryption.
We assume that the patch size P in ViT is the same as the main-block size S mb in the image encryption and that the number of patches N is equal to the number of main blocks N mb . The patch set x p has N × P × P × 3 dimensions, and the main-block image x mb has N mb × S mb × S mb × 3 dimensions-namely, x p and x mb are identical. Here, we define both x α mb ∈ R S mb ×S mb ×3 and x α sb ∈ R N sb ×S sb ×S sb ×3 as a single main block, respectively. Note that α ∈ {1, 2, · · · , N}, and N is equal to N mb , and so α is an index denoting the main-block number. x α mb is a part of x mb without sub-block division, while x α sb is a part of x mb with sub-block division. They are represented as x sb = (x 1 sb x 2 sb · · · x N mb sb ).
x p and x mb are identical, so the patch x α p and the main block x α mb are treated as one and the same. Therefore, x α fp obtained by flattening x α p is also derived from flattening x α mb . Hereafter, P and N will be denoted as S mb and N mb , respectively, for the sake of consistency. Figure 7 illustrates a model-encryption procedure. One of the purposes of model encryption is to ensure that the classification results are never affected by image encryption. Thus, we transform the parameters E and E pos in the trained model with the key set K, which is the same as for the image encryption. Each operation in the model encryption is compatible with each operation in the image encryption. The model-encryption procedure is described as follows.
Step m-1: Transform E to obtain E sb ∈ R N sb ×S sb ×S sb ×3×D .
Step m-2: Translocate indices in the first dimension of E sb using K 1 .
Step m-3: Translocate indices in the second and third dimensions of E sb using K 2 .
Step m-4: Flip or retain the signs of the elements in E sb using K 3 .
Step m-5: Translocate indices in the fourth dimension of E sb using K 4 .
Step m-6: Transform E sb into the original dimension of E to derive E ∈ R (3·S mb ·S mb )×D .
Step m-7: Translocate rows in E pos using K 5 to obtain E pos ∈ R (N mb +1)×D .  Figure 8 illustrates the relationship between a divided image and E. We transform E to E mb ∈ R S mb ×S mb ×3×D and then obtain E sb in Step m-1. This step allows E to be encrypted directly by using the vectors for the sub-block encryption.
As mentioned in Section 2.1, E and E pos correspond to x α fp and x α fp E, respectively. Each operation in the image encryption generally sacrifices their correspondence. Accordingly, the common image-encryption methods significantly degrade the classification accuracy. In contrast, an image-encryption method based on the EtC system is compatible with each parameter of ViT. Taking advantage of this compatibility, we proposed a model-encryption method for ViT without any degradation to the classification accuracy caused by image encryption [17]. Our previous method demonstrated that the classification accuracy was never affected by encryption [25].
We detail each operation in the model encryption below. Hereafter, E sb(δ) ∈ R N sb ×S sb ×S sb ×3×D , where δ ∈ {1, 2, 3, 4}, represents a parameter after the δ-th operation to E. Further, E sb (s, h, w, c, d) and E sb(δ) (s, h, w, c, d), where d ∈ {1, 2, . . . , D}, denote the elements of E sb and E sb(δ) , respectively. rela�onship Figure 8. Relationship between divided the image and ViT parameter E. Blue dots represent single pixels in the segmented image, and green dots represent single rows in E corresponding to blue dots.

Index Translocation in the First Dimension
We first translocate indices in the first dimension of E sb . On the basis of Equation (6), the sub-block translocation replaces the indices in the second dimension of x sb with vectors v i derived using K 1 . The second dimension of x sb corresponds to the first dimension of E sb . Thus, the indices in the first dimension of E sb should be translocated by replacing them with v i :

Index Translocation in the Second and Third Dimensions
Next, we translocate indices in the second and third dimensions of E sb (1) . As shown in Equation (8)

Sign Flipping
Here, we flip signs of the elements in E sb (2) . As described in Section 3.2.4, the negativepositive transformation with normalization is regarded as an operation to flip or retain the signs of the pixel values in x sb (2) . We determine whether to flip or retain the signs of the elements in E sb(2) responding to vectors n i generated using K 3 . E sb(2) is consequently transformed as

Index Translocation in Fourth Dimension
We then translocate indices in the fourth dimension of E sb (3) . As shown in Equations (15)-(17), the color component shuffling translocates the indices in the fifth dimension of x sb(4) on the basis of the vector a derived using K 4 . The fifth dimension of x sb(4) corresponds to the fourth dimension of E sb(3) . We, thus, translocate the indices in the fourth dimension of E sb(3) by using a: and E sb (4)

Row Translocation
Finally, we translocate rows in E pos . As shown in Equation (19), the main-block translocation replaces the indices in the first dimension of x sb(5) with vector k obtained by K 5 . Both α and the first dimension of x sb(5) represent the main-block number, and so the main-block translocation is regarded as an operation to replace α with k. To preserve the relationship between E pos and x α fp E, the rows in E pos should accordingly be translocated by using k as where E pos (g, d) and E pos (g, d) denote the elements of E pos and E pos , respectively. Note that g ∈ {1, 2, . . . , N mb + 1} is an index corresponding to the dimensions of E pos and E pos .

Evaluation Metrics
We verified the effectiveness of JPEG compression in terms of compression and classification performance. We calculated the average amount of image data to evaluate the compression performance. In addition, we prepared two metrics to assess the classification performance: the classification accuracy and change rate. In this paper, the change rate provides the percentage of difference between the classification results for plain test images with a plain trained model and those for target images with a target model. For instance, the target images and target model means JPEG-compressed EtC images and an encrypted model, respectively. In the case that the change rate indicates 0%, both classification results are identical.
For scheme A, shown in Figure 3a, we provide five patterns for the quality factor (Q): 100, 95, 90, 85, and 80. To compare the effects of JPEG compression, each metric was also calculated for EtC images compressed by linear quantization. In comparison, scheme B, shown in Figure 3b, compressed both training images and EtC images by using JPEG with Q = 85. In common with scheme A, the classification accuracy was also calculated for the case of using linear quantization. Hereafter, the EtC images and the training images after the linear quantization are called quantized EtC images and quantized training images, respectively.

Experiments
In this section, the effects of JPEG compression are examined in terms of classification and compression performance by using the metrics described in Section 3.4.

Experimental Setup
We used the CIFAR-10 dataset with 10 classes in this experiment. This dataset consists of 50,000 training images and 10,000 test images. All image sizes are 32 × 32 pixels, while we preliminarily resized each image to 224 × 224 pixels by using the bicubic interpolation method. All training and test images were stored in PPM format.
The ViT model is trained through two phases: pre-training and fine-tuning. In this experiment, we used a pre-trained ViT model using ImageNet-21k with a patch size P = 16. We then fine-tuned the pre-trained ViT model by using plain training images for scheme A or JPEG training images for scheme B. In both schemes, the ViT model was fine-tuned with a learning rate of 0.03 and an epoch of 5000.
In the image encryption, the main-block size S mb was defined as 16, which was the same as P, while the sub-block size S sb was set to 8 or 16. Additionally, as mentioned in Section 3.2, we could choose either the common or independent transformation in regard to color components. Consequently, four types of EtC images were generated for each test image. Figure 9 shows EtC, JPEG-compressed EtC and quantized EtC images for a single test image. Note that we used 4:2:0 downsampling for the JPEG compression.  Table 1 shows the average amount of data in the JPEG-compressed EtC images and the quantized EtC images. This table also includes the average amount of data in the EtC images without compression and in the plain test images with and without compression. After the linear quantization, pixel values of each color component are represented by a single bit, and so the average amount of image data is 3 bpp. This table indicates that JPEG compression with Q ≤ 95 reduced a larger amount of data than linear quantization. We also found that the JPEG-compressed EtC images with S sb = 16 and common transformation had an analogous amount of data to the plain test images with JPEG compression at each value of Q. Table 2 summarizes the classification accuracy and change rate for scheme A. For comparison, this table also gives the results for the quantized EtC images through the encrypted model and for the EtC images without compression through the encrypted model. Note that the latter results could be obtained by our previous method [17]. This table also provides the results for the plain test images with and without compression through the plain model. The change rate is calculated on the basis of the classification results for the plain test images without compression through the plain model. With each value of Q, the classification accuracy and change rate for any encryption pattern were nearly equal to those obtained by using the plain test images and model. It is also clear that JPEG compression for the EtC images preserved a significantly high classification accuracy with a low change rate in any case, while the linear quantization sacrificed the accuracy in return for data reduction. For scheme A, the lowest classification accuracy and highest change rate were obtained in the case of Q = 80, S sb = 8, and independent transformation. Even with this pattern, the classification accuracy was still 97.67%, and the change rate was still low at 1.94%. Table 3 shows the classification accuracy for scheme B with Q = 85. Here, the model was trained with JPEG training images. In this table, we include the results for the plain test images with JPEG compression through the plain model. For further comparison, this table also includes the results obtained by using linear quantization. In this case, the model was trained with quantized training images. As shown in this table, JPEG compression for both the training images and the EtC images hardly degraded the classification accuracy, while the linear quantization still substantially decreased the accuracy.

Experimental Results
Comparing scheme B and scheme A with Q = 85 in Table 2, the classification accuracy for the JPEG-compressed EtC images was slightly improved by using the encrypted model trained with the JPEG training images. Accordingly, the results for schemes A and B show that JPEG compression for training images was comparatively effective in improving the classification accuracy for JPEG-compressed EtC images.

Discussion
Here, we discuss the effects of JPEG compression for EtC images. Figure 10 illustrates the compression ratio at each quality factor. This figure is derived from the results in Table 1 Note that the amount of uncompressed EtC-image data is constantly 24.00 bpp. As shown in Figure 10, the non-encrypted images, i.e., original images, had a comparable performance to the EtC images with S sb = 16 and common transformation. This means that the suitable conditions for the EtC system do not affect the compression performance. The figure also shows that JPEG compression could reduce the data amount 75-90% at the highest quality factor, Q = 100. Further, the data amount decreased by more than 90% in the case of Q ≤ 90. These results demonstrate that JPEG compression can significantly reduce the amount of EtC-image data.
On the basis of Table 2, we show the degradation in classification accuracy caused by JPEG compression in Figure 11. The negative sign indicates degradation. The maximum degradation in this figure was 1.22% in the case of the independent transformation with S sb = 8 and Q = 80. Thus, JPEG compression in practical use causes little degradation to the classification accuracy. We can conclude that JPEG compression is effective in drastically reducing the amount of EtC-image data while preserving high classification accuracy.
JPEG compression has an option to not downsample the chrominance component. Figure 12 shows the classification accuracy at each quality factor with and without downsampling. Note that S sb is 16 in this figure. We confirmed that the classification accuracies with and without downsampling had similar trends. We employed the EtC system on the premise of applying JPEG compression. The main-block size in the EtC system was the same as the patch size in ViT. It is important that both the main-block and sub-block sizes are multiples of 8 (or 16 with downsampling) to be equal to the block size of JPEG. Therefore, the main-block and sub-block sizes should be defined on the basis of the block size of JPEG. When the block-size condition is not satisfied, we confirmed that the classification accuracy and compression performance degraded significantly. In other words, the condition allows us to keep the classification accuracy and compression performance high.
JPEG compression generally eliminates image data in the high-frequency component. Therefore, this study suggests that noise added to the high-frequency component has little effect on ViT classification. Additionally, noise-added encrypted images generally have high robustness against attacks. Thus, JPEG noise is also expected to enhance the robustness of EtC images against attacks.

Conclusions
We investigated the effects of JPEG compression for EtC images on classification results using ViT. JPEG compression never caused severe degradation to the classification accuracy for EtC images; the maximum degradation was 1.22% even when the quality factor was 80. Additionally, the data amount of EtC images was reduced more than 90% under the quality factor. These results proved that JPEG compression for EtC images not only drastically reduced the amount of data but also caused little degradation to the classification accuracy. Further, JPEG compression for plain training images was marginally effective in improving the classification accuracy. Compared with linear quantization, JPEG compression was more effective in terms of the classification and compression performance.
This paper suggests that noise added to the high-frequency component not only keeps the classification accuracy high but also enhances the robustness against attacks. However, the relationships between different types of noise and the classification accuracy or robustness has not been studied in detail. In future work, we will investigate this relationship for more reliable and robust image classification.